Heuristics for Fixing Common Errors in Deployed schema.org Microdata
نویسندگان
چکیده
Being promoted by major search engines such as Google, Yahoo!, Bing, and Yandex, Microdata embedded in web pages, especially using schema.org, has become one of the most important markup languages for the Web. However, deployed Microdata is most often not free from errors, which limits its practical use. In this paper, we use the WebDataCommons corpus of Microdata extracted from more than 250 million web pages for a quantitative analysis of common mistakes in Microdata provision. Since it is unrealistic that data providers will provide clean and correct data, we discuss a set of heuristics that can be applied on the data consumer side to fix many of those mistakes in a post-processing step. We apply those heuristics to provide an improved knowledge base constructed from the raw Microdata extraction.
منابع مشابه
What the Adoption of schema.org Tells About Linked Open Data
schema.org is a common data markup schema, pushed by large search engine providers such as Google, Yahoo!, and Bing. To date, a few hundred thousand web site providers adopt schema.org annotations embedded in their web pages via Microdata. While Microdata and Linked Open Data are not 100% the same, there are some commonalities which make a joint analysis of the two valuable and reasonable. Prof...
متن کاملA Computer-Guided Approach to Website Schema.org Design
Schema.org offers to web developers the opportunity to enrich a website’s content with microdata and schema.org. For large websites, implementing microdata can take a lot of time. In general, it is necessary to perform two main activities, for which we lack methods and tools. The first consists in designing what we call the website schema.org, which is the fragment of schema.org that is relevan...
متن کاملHL7 FHIR and Schema.org
Schema.org was developed by a number of major search engine companies such as Bing, Google and Yahoo! as a common vocabulary for marking up web pages. The combination of HTML and Microdata, RDFa 1.1 Lite or JSON-LD enables a well-known set of semantic tags to be added to existing human-readable web pages. Schema.org has been widely adopted by public web sites and multiple extensions have been c...
متن کاملA Quantitative Analysis of the Use of Microdata for Semantic Annotations on Educational Resources
A current trend in the semantic web is the use of embedded markup formats aimed to semantically enrich web content by making it more understandable to search engines and other applications. The deployment of Microdata as a markup format has increased thanks to the widespread of a controlled vocabulary provided by Schema.org. Recently, a set of properties from the Learning Resource Metadata Init...
متن کاملEnriching Webpages with Semantic Information
This paper proposes a tool to automatically enrich webpages with semantic information by annotating keywords in the document with microdata markup. There are two case studies described and implemented in this paper. The first case study focuses on generating new webpages with microdata and the second case study focuses on enriching existing webpages with microdata. This paper also demonstrates ...
متن کامل